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In many scientific settings data can be naturally partitioned into 
variable groupings called views. Common examples include environ- 
mental (1st view) and genetic information (2nd view) in ecological 
applications, chemical (1st view) and biological (2nd view) data in 
drug discovery. Multi-view data also occur in text analysis and pro- 
teomics applications where one view consists of a graph with obser- 
vations as the vertices and a weighted measure of pairwise similar- 
ity between observations as the edges. Further, in several of these 
applications the observations can be partitioned into two sets, one 
where the response is observed (labeled) and the other where the re- 
sponse is not (unlabeled) . The problem for simultaneously addressing 
viewed data and incorporating unlabeled observations in training is 
referred to as multi-view transductive learning. In this work we in- 
troduce and study a comprehensive generalized fixed point additive 
modeling framework for multi-view transductive learning, where any 
view is represented by a linear smoother. The problem of view se- 
lection is discussed using a generalized Akaike Information Criterion, 
which provides an approach for testing the contribution of each view. 
An efficient implementation is provided for fitting these models with 
both backfitting and local-scoring type algorithms adjusted to semi- 
supervised graph-based learning. The proposed technique is assessed 
on both synthetic and real data sets and is shown to be competitive 
to state-of-the-art co-training and graph-based techniques. 

1. Introduction. In many scientific applications the available data come 
from diverse domains which are referred to as views henceforth. The views 
may consist of collections of numerical and categorical variables, but also 
may correspond to observed graphs. The objective of this study is to in- 
troduce a comprehensive modeling framework for a numerical or categor- 
ical response variable that is a function of data from distinct views. As 
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a motivating example, consider a collection of documents belonging to a 
particular scientific domain, for example, papers in statistics journals. The 
available information about the documents can be organized in the follow- 
ing three views: the corpus of the documents, that is, a collection of words 
in the documents [Blum and Mitchell (1998)]; information describing the 
documents (e.g., title, author, journal, etc.) [McCallum et al. (2000)]; and 
the co-citation network (graph) [McCallum et al. (2000), Neville and Jensen 
(2005)]. For the graph, nodes correspond to documents (observations) and 
edges count the number of citations to the same papers (pairwise similar- 
ity) . The goal for this problem is to classify a document according to an at- 
tribute (e.g., whether a paper is applied or theoretical) where the attribute 
is known (labeled) for only a subset of the documents, with the remainder 
being unknown (unlabeled). In this context, the documents must be labeled 
by human action, whereas the view information can be obtained in an auto- 
mated fashion (i.e., the set of labeled observations L is significantly smaller 
than the unlabeled one; |L| <C |C^|). Further, it is worth noting that the first 
two views can be structurally represented by a data matrix with rows cor- 
responding to observations (documents) and columns to variables, but the 
third view is given directly in the form of an observed graph. 

Another example of multi-view data arises in drug discovery applications. 
Suppose that a very large number of characteristics (e.g., > 1000) has been 
collected for a library of chemical compounds. These characteristics range 
from high throughput screening measurements of compounds' effectiveness 
against numerous biological targets [Lundblad (2004), Hunter (1995)] to 
a compound's absorption, distribution, metabolism, excretion and toxic- 
ity (ADMET) properties [Fox et al. (1959), Kansy, Senner and Gubemator 
(2001)]. Further, given the chemical structure of a compound, it is nowadays 
fairly easy to computationally measure physical properties of each com- 
pound [Leach and Gillet (2003)]. Given data on the response of a subset 
of compounds in a library for a particular target (e.g., whether or not a 
side-effect is associated with the compound), the goal is to use the data 
available in these diverse views (biological, chemical, ADMET) to predict 
the response for the remaining members in the library. Notice also that the 
target status of a potential drug can be both time consuming and hard to 
determine (e.g., side effects in humans may take many years to appear), 
whereas the biological and chemical compound characteristics can be ob- 
tained in a shorter time period (usually days to weeks) and with less effort 
(hence, |L| <C \U\). Other examples of multi-view problems are present in 
applications involving genomic [Nabieva et al. (2005)] and proteomic data 
[Yamanishi, Vert and Kanehisa (2004)]. 

As illustrated with these examples, the available data can be naturally 
partitioned into disjoint data sets, referred to as views, that in some cases 
can be represented as data matrices, while in other cases come in the form 
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of graphs. Views comprised of variables can differ in the number of vari- 
ables, variable type (numerical, ordinal, nominal), noise level and scale. 
Graph views may differ in the node degree distribution, type and distri- 
bution of edge weights. Traditionally, models for the prediction problem at 
hand have been built that include all the variables available, without taking 
into consideration the presence of distinct views. Further, data in the form 
of an observed graph were ignored completely. Popular techniques for build- 
ing flexible prediction models include recursive partitioning [Breiman et al. 
(1984)], multivariate adaptive regression splines [Friedman (1991)], random 
forests [Breiman (2001)], support vector machines [Vapnik (1998)], partial 
least squares [Mevik and Wehrens (2007)], etc. Nevertheless, there are sev- 
eral situations where incorporating distinction of views offer a number of 
advantages from a data analysis point of view, including: 

• View level analysis: In many applications it is of great interest to develop a 
model that provides insight into the underlying relationship among views, 
potentially identifying interactions between them, and also to assess their 
predictive capabilities. The latter can prove particularly useful in problems 
where collecting the necessary data for a view may be resource demanding 
and thus expensive. 

• Incorporation of graph information: As already discussed, in many ap- 
plications some of the available data come in the form of a graph that 
available statistical models can not handle in a straightforward manner. 

• Improving predictive performance: Allowing the available data to be par- 
titioned into different views and incorporating interactions among them 
offers the advantage of building more flexible and potentially more power- 
ful models, exhibiting better performance in terms of prediction accuracy. 

In this work we introduce an additive modeling framework that takes into 
consideration the presence of distinct data views. Further, it incorporates 
in a seamless fashion observed graphs, allows for view level analysis and on 
many occasions leads to significant gains in performance. 

The main idea of this framework is to represent each view by a linear 
smoother. The difficultly is in providing representative multi-dimensional 
view smoothers on any data type (graphs or numerical/categorical), while 
accounting for the sparse labeling of the response which occurs in several of 
the applications under consideration (|L| <ti \U\). To define a smoother for 
an observed graph, we build on recent advances in graph-based transduc- 
tive learning [Blum and Mitchell (1998), Zhu (2007)]. Specifically, graph- 
based transductive learning addresses the problem of learning in a setting 
where the available data come in the form of a graph (labeled and unla- 
beled observations correspond to vertices/nodes and pairwise associations 
to edges), where a numerical or categorical variable can also be associated 
with each node on the graph. In this context. Gulp and Michailidis (2008a) 



4 



M. GULP, G. MICHAILIDIS AND K. JOHNSON 



note that the adjacency matrix of the appropriately normahzed graph leads 
to a stochastic matrix that resembles a kernel smoother with a transductive 
form defined on both labeled and unlabeled nodes. In this work we define a 
transductive smoother in general as a linear smoother defined for a response 
that has a missing unlabeled component. In the case of numerical, categorical 
or ordinal data views, it is fairly straightforward to extend a classical linear 
smoother [see, e.g., Hastie and Tibshirani (1990)] into a transductive one 
[Gulp and Michailidis (2008a)]. Upon obtaining the transductive smoother 
for each view, the next challenge is in fitting a model to a smoother of 
this form, since the smoother is linear in the response partitioned with a 
labeled (observed) and unlabeled (missing or unobserved) component. To 
address this, we propose a novel generalized fixed point self-training frame- 
work (Section 2) that essentially extends the classical generalized additive 
model into the multi-view transductive setting. Under reasonable conditions 
on the transductive smoother, the solution is guaranteed to uniquely exist. In 
addition, the computational issues are addressed using established iterative 
self-training procedures for both the regression and classification settings 
[Gulp and Michailidis (2008a), Zhu (2007)]. 

The proposed modeling framework treats both the variable and graph 
views represented by the transductive smoothers as the equivalent of "vari- 
ables" in a generalized additive model which can subsequently be fitted 
by an extension of the common backfitting (local scoring) algorithm to self- 
training. Due to the linearity of the solution in the response variable, existing 
model selection techniques can be readily applied to select important views. 
Also, the smoothers require estimation of underlying parameters, as in the 
classical case, and we investigate a criterion more appropriate for transduc- 
tive smoothers defined on views. The results indicate that the multi-view 
model using this estimation approach is quite competitive with the state-of- 
the-art multi-view techniques discussed next. 

1.1. Relevant existing multi-view learning approaches. We provide next 
a brief exposition of existing approaches geared toward improving accuracy 
in multi-view learning problems. 

It is natural to consider the general semi-supervised classification prob- 
lem as a precursor to the multi-view setting. In semi-supervised learning a 
relatively small percentage of the observations (cases) contain labels. The 
objective is to use the labeled cases and their relation to the unlabeled cases 
to complete the labeling of the data. Upon label completion, the classifier 
can be used to predict new cases (inductive) or must be retrained/updated 
(transductive) to incorporate this information into the classifier. Various al- 
gorithmic solutions available for this problem include self-training [Abney 
(2004)], graph regularization [Wang and Zhang (2006)], semi-supervised SVM 
[Ghapelle, Sindhwani and Keerthi (2008)], and parametric models 
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[Krishnapuram et al. (2005)]. For example, in Zhu, Ghahramani and LafFerty 
(2003) the authors propose a quadratic energy optimization problem lead- 
ing to a harmonic estimate for the unlabeled data with the constraint that 
their labeled estimate retains the original labels. This approach has several 
connections with electrical circuits [Zhu, Ghahramani and Lafferty (2003)], 
ST-minicut clustering techniques [Blum and Chawla (2001)], spectral ker- 
nel techniques [Joachims (2003), Johnson and Zhang (2007)], and kernel 
smoothing approaches [Lafferty and Wasserman (2007), Gulp and Michailidis 
(2008)]. The survey by Zhu (2007) and the book by Ghapelle, Scholkopf and Zien 
(2006) highlight several of these semi-supervised approaches and address 
both theoretical and practical issues. 

In multi-view learning Blum and Mitchell (1998) developed a co-training 
procedure for classification problems that is based on the idea that better 
predictive models can be found at the individual view level, rather than fit- 
ting a model directly on all the available views. The co-training procedure 
trains a separate classifier for each view and then proceeds in a self-training 
fashion by iteratively treating the most confident unlabeled observations as 
true labeled ones using the fitted class estimates as the true values. After 
a prespecified number of iterations, co-training produces a classification of 
every observation in the data. A final classifier can be formed through a 
combination of the individual view classifiers, in order to predict new ob- 
servations unavailable during the training phase. The intuition behind this 
approach is that if the results of the individual classifiers arrive at the same 
classification for either a labeled observation (known response) or, more im- 
portantly, an unlabeled observation (unknown response), and the views are 
conditionally independent, then it is highly likely that the derived classifi- 
cation is correct [Blum and Mitcheh (1998), Abney (2002)]. 

Another set of procedures are based on a transductive graph-based learn- 
ing setting [Joachims (2003), Zhu (2007), Gulp and Michailidis (2008)], where 
the underlying graph is either observed directly or constructed from the data 
matrix. In this setting, each view can be captured by a graph with its cor- 
responding adjacency matrix, and the views are integrated by adding their 
adjacency matrices. For example, the Spectral Graph Transducer (SGT) 
treats the resulting graph as an energy network, where the labeled obser- 
vations are positive or negative sources and the objective is to determine 
an optimal energy estimate for the unlabeled responses [Joachims (2003)]. 
An approach that shares some of the SGT's characteristics is the Sequential 
Predictions Algorithm (SPA), which forms the final graph by using graph 
theoretic operations such as unions and intersections [Gulp and Michailidis 
(2008)]. This procedure employs a local kernel smoothing algorithm with 
a regularized extrapolation penalty that shrinks the estimates of unlabeled 
nodes farther away from labeled ones toward the class prior distribution. 
It can be seen that such graph-based procedures can naturally incorporate 



6 



M. GULP, G. MICHAILIDIS AND K. JOHNSON 



multiple views. However, views comprised of numerical variables must first 
be converted into graphs, which may not be the most effective way of rep- 
resenting the data. Further, high performance classifiers such as support 
vector machines or random forests cannot be used in this setting. 

The proposed modeling framework shares some features with existing co- 
training and graph based approaches. However, by building a smoother for 
each view and then combining them through a generalized additive model, 
the proposed approach offers useful tools such as view level analysis and 
incorporation of graph terms, together with performance improvements. The 
remainder of the paper is organized as follows: In Section 2 we introduce 
the modeling framework and address estimation and model selection issues. 
Section 3 illustrates the model on a number of real and simulated data sets. 
Some concluding remarks are drawn in Section 4. 

2. Modeling framework for multi-view data. For the problem at hand, 
let Y denote the response variable of length n = \LUU\, partitioned into 
the set L of labeled observations and U of unlabeled ones (specifically Yjj 
is missing with \U\ > 0), that is, Y = \Y^Ylj\' . The available predictors can 
be partitioned into q distinct views, where views may consist of variables, 
observed graphs or both. Views comprised of numerical, nominal or ordinal 
variables are represented by data matrices Xi of size n x p£. It is assumed 
that a particular variable can only be present in one view. Each individual 
data matrix can be partitioned row-wise into two disjoint labeled and unla- 
beled sets: Xi = [X'j^^X'jj^' . Views can also correspond to observed graphs. 
On = {Ni, Ei), with Ng = LUU denoting the node set and Eg the edge set. 
The similarity weighted n x n adjacency matrix Ag for observed graph Gg 
can also be partitioned in the following way: 



with Afjj^, -^uu representing edges between labeled nodes, between 

labeled and unlabeled nodes and between unlabeled nodes, respectively. 

As noted above, the response variable is partitioned into a labeled and 
unlabeled component, which induces the corresponding partitions to X and 
G, respectively. The proposed modeling framework accommodates multiple 
views, as well as their interactions, as follows: 






where r/ = g{^) denotes the link function of the response Y = \Y[,Ylj\' for 
which Yjj is missing, with E(y) = a is an intercept term, ^^^e 



MULTI-VIEW LEARNING WITH ADDITIVE MODELS 



7 



smooth functions defined on the feature space X, and {fe^i-)} are smooth 
functions defined on the nodes of G [Gulp and Michaihdis (2008a), Zhu 
(2005)]. 

The main difficulty with (2.f ) stems from the transductive nature of the 
model, due to the presence of the missing response vector Yjj. To fit (2.1), we 
propose next a two stage optimization framework referred to as generalized 
fixed point self-training. For this approach, we must first define the training 
response as Yy^ = [yl^^u]' with g{Yu) G M}^^ an arbitrary initialization. The 
training response is then employed in two stages, each discussed in detail 
next to obtain an estimate Y = [Yl^,Y^]' = g~^{fj). 

In the first stage, the training response Yy^j is employed to determine an 
estimate for r] = a + J2 fi by solving 

(2.2) mm L(YY,,g-Hr^)) + J{f), 

where L(y, f) is a loss function that increases as the deviance from y and 
/ increases, and J(/) is an appropriate penalty term on /. The key issue is 
that of existence and uniqueness of the resulting estimate fl(Yif) as a func- 
tion of Yfj. Prom this perspective, it can be seen that the posited problem is 
a "supervised" one with respect to response ly^ and data X,G. As a result, 
there are a number of well-known approaches that lead to a unique solu- 
tion including SVMs, logistic regression, additive models, neural nets, etc. 
[Hastie, Tibshirani and Friedman (2001)]. Upon completing the first stage 
an estimate Yy^ = g~^{f]{Yi/)) is obtained for the entire response vector Y 
as a function of Yjj . 

The second stage deals with the problem of optimally determining an 
appropriate value of Yu necessary for training purposes. It can be chosen as 
the solution to the following optimization problem: 

(2.3) min(g(yc/) - wiYu))' {g{Yu) - m{Yu)), 

'U 

that is, the deviance between g{Yu) and fiu(Yu) is minimized. A moment 
of reflection shows that the optimal Yjj corresponds to a fixed point. Exis- 
tence and uniqueness of the fixed point Yjj = g~^{fiu{Yu)) are a key issue 
[Kakutani (1941)]. In several cases the solution can be obtained in a direct 
manner, given the form of ?)(•). However, in other circumstances the fixed 
point solution must be approximated. One way to approximate is using 
Newton's method whose kth update step is given by 

^(.+1) ^^(.) _ (/_ v^,-i(%(y^))|^^^^«)-^(y('=) 

A key assumption is that the maximum eigenvalue of the gradient ^g~^{fnj{-)) 
is less than one, which renders the corresponding map a contraction, thus 
guaranteeing the existence of a fixed point. By the derivative chain rule. 
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this approach requires the gradient of f)u{Y^ ) for each fc, which can be 
computationally demanding to obtain. This motivates the following slower 
iterative self-training algorithm [Gulp and Michailidis (2008a)]: 

1. Initialize the unlabeled response vector, Y^^ and get tolerance 5. 

2. Iterate until \\f]^'^ — fj^ ^^\\ < 6 

(a) Solve (2.2) with response Y^(k-i) = [Y[,Y^' ]' and data {X£,Gi} 
to get r)^'^). 

(b) Set Yi''>=g-^{rjij'^). 

Convergence of this algorithm provides an approximation to the fixed point 
defined in (2.3) by construction. Whenever there exists an initialization 
that results in local convergence of the above procedure, then the fixed 
point exists and is approximated by the algorithm. Moreover, if the algo- 
rithm converges globally independent of the initialization, then the fixed 
point is uniquely approximated by the procedure. The global convergence 
depends on the specific choices for {fj{-)} [Gulp and Michailidis (2008a)]. 
We provide below the details for fitting this procedure first for squared error 
loss and then for logistic loss when {/j(-)} are estimated using transductive 
smoothers. 

Before we discuss transductive smoothers, we briefly address the bias and 
variance of an estimate / resulting from the proposed fixed point self-training 
approach in the regression context. To begin, we consider first the supervised 
case with the goal of estimating a function fi from data {Xl,Yl), where 
tjie response Yl is continuous. It is well known that the supervised error of 
/l with respect to response Yl can be decomposed into bias, variance and 
irreducible terms [Hastie, Tibshirani and Friedman (2001)]. Recall that the 
semi-supervised problem using the fixed point self-training method arrives at 
an estimate / = [f^, flj]' with data (Yj^,X). From this, the error of estimate 

/ with respect to response Y^^ can be decomposed as 

Error (/) = + Bias(/L | X) + Var(/L | X) + Error (/t;), 

where is the irreducible error term, Bias(/L | X) and \ai{fL \ X) are 
the respective bias and variance of the labeled estimate relative to the true 
function conditioned on the full data X, and Error(/{/) = by construction. 
The resulting supervised and semi-supervised bias and variance terms are 
given by 

Training Bias/Variance 

Supervised Error = cr^ + Bias(/j;, | Xl) + \ai{fi \ Xi) 

Self-training Error = cr^ + Bias(/L | X) + Var(/z, | X) . 
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Therefore, the self-training approach allows one to determine a labeled esti- 
mate, fi, that balances the bias/variance tradeoff by employing the entire X 
information, whereas the corresponding supervised problem achieves a sim- 
ilar goal by only using the Xl information. Naturally, this decomposition 
extends in the presence of graph views. 

Remark 1. The connection between the two stage optimization ap- 
proach and the self-training algorithm reveals that this framework is a semi- 
supervised example of a block relaxation algorithm [see the discussion in 
Leeuw (1994)]. 

2.1. Fitting the additive model in the regression context. For ease of pre- 
sentation, we study the simplest form of (2.1) using the fixed point self- 
training approach that combines a numerical/categorical feature set X and 
a graph view G for a continuous response Y^. The resulting model is given 
by 

(2.4) Y = a + fi{X) + h{G)+e. 

Our strategy in fitting this model is based on constructing transductive 
smoothers for both the X and G views. We provide next the details of such 
a construction and extend it to incorporate interactions among these two 
views. The implementation details are presented in Section 2.4. 

To fit the function r] = a + fi{X) + f2{G) under a squared-error loss cri- 
terion, we must minimize with respect to / = [/(/2]' the following: 

2 

(2.5) min(yy^ - 7?)'(yy^ - ^?) + E ^jf'^Pifi^ 

^ i=i 

where Pj's are penalty matrices for each view, a = Yy^, and Xj > the asso- 
ciated tuning parameters. For a graph view we choose P as the combinatorial 
Laplacian operator, that is, P = D — A, where A is the adjacency matrix and 
D is its row-sum diagonal matrix. For X data views, there are many choices, 
including generalized additive models, spline-based models, nonparametric 
models, etc. [Hastie and Tibshirani (1990)]. Below we provide more details 
on the penalty matrix for the graph case (Section 2.1.1) and for the feature 
X data case (Section 2.1.2). No matter how it is obtained, each penalty 
matrix emits the following partition: 

In the above expression, the submatrix Pll captures associations between 
labeled portions of the data, while Pjjl and Pljj labeled to unlabeled data 
associations, and finally Puu unlabeled to unlabeled ones. Each penalty 
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matrix, Pj, is assumed to be positive semidefinite, which in turn imphes 
that the problem in (2.5) is jointly convex. Therefore, one can solve it to 
obtain the following equations: 



(2.7) f^-sAYY^-T^fj] = ioi £ = 1,2 with Se = C {I + XePi)-\ 




where C = {I — ll'/n) is a centering matrix. This is clearly an extension of 
the Gauss-Seidel algorithm with response ly^ and smoothers {Si}1^^: 



{21 



I Si\(h{X)\^(SiYY^ 
S2 l)\h{G)) \S2Yy^ 



The solution of the Gauss-Seidel algorithm is well known to take the form 
fjiXu) = a + /i(^) + hiG) = FCYyv with smoother R independent of Yyjj 
[Hastie and Tibshirani (1990)]. Therefore, the first stage in our model fitting 
strategy results in a linear fitting technique, fiiYu) = RYyu- 

For the fixed point step (2.3), we need to define the class of transductive 
smoothers generated from X, G or both as follows: 

A[-] = {S : S is an n X n linear smoother matrix 

constructed from source such that p{Suu) < l}i 

where S G A[-] emits the partition 

and p{-) denotes the spectral radius of the matrix under consideration. The 
partitions in the transductive smoother correspond to the partitions in the 
penalty matrix given above, that is, (2.6). From the first optimization prob- 
lem discussed above, we get that for any smoother S the unlabeled estimate 
is given by "qijiYu) = SjjlYl + Suij^u- The optimization problem in (2.3) 
subsequently reduces to 

min((/ - Suu)Yu - SulYl)'{{I - Suu)Yu - SulYl), 
Yu 

and therefore, the condition that p{Suij) = p{^flu{ )) < 1 results in Yu = 
(/ — Suu)~^ SulYl as the unique fixed point. From this, in the case of a 
regression model a closed form solution can be obtained as 

/r, 1 n^ f^L \ _ f Sll + Slu{I - Suu)~^SuL \ V 

{i-suur'suL r'- 

As expected, the resulting predicted responses are linear in the labeled data 
Yl, that is, Y = Ml-Yl, with Mll and Mm the respective \L\ x \L\, \U\ x \L\ 
matrices identified in parenthesis in the above expression. 



MULTI-VIEW LEARNING WITH ADDITIVE MODELS 



11 



A special case arises when fi{X) is modeled by a linear model, that is, 
fi{X) =X(3. The additive model in (2.4) reduces to the following semi- 
parametric model: 

Y = Xp + f2{G)+e, 

which is fit using transductive smoothers, Si = H = X(X'X)~^X' and cen- 
tered symmetric smoother 82 = 8 defined on G. The goal is to obtain a 
closed form expression for /3 and /2 such that fj = Xf3 + f2- We first as- 
sume that the solution uniquely exists for each Yjj, that is, there exists a 
unique R with f}(Yr/) = RYy^j. Now apply (2.8) to get that {X' X)''^ i3{Yu) = 
X'{Yy^ - hiYu)) and hiYu) = SiYy^ - X(3(Yu)), which yields 

PiYu) = {X'{I - S)Xr'x'{I - S)Yy^ 

and 

f2{Yu) = S{YY^-X${Yu)). 

Therefore, the Gauss-Seidel algorithm obtains the function estimate 'f]{Yu) = 
(/ — S)XP{Yij) + SYy^ = RYyu- For the fixed point phase, we assume that 
piRuu) < 1 to get that Yu = miXu) = XuPiYu) + huiXu)- Profiling out 
f2u from Yu and after some algebra, we get that 

(X'il - S)X)p = X'{I - S)Yy^ = X'{I -S)[Y + Xp] -X'l{I- Mll)xJ, 

where Mll = Sll + Siuil - Suu)-'Sul and Yl = [F/, Y{j]' with Yu = (I - 
Suu)~^SulYl- Solving for the coefficient yields the following estimate for 
both (3 and /2: 

(2.11) /? = {X'uil - MLL)XLr'x'u{I - Mll)Yl, 

(2.12) f2L = MLL{YL-Xj) and hu = Mul{Yl - Xj), 

with MuL = {I-Suu)"^SuL- This results in 77^ = {I - Mll)XlP + MllYl, 
which is the natural generalization of the classical semi-parametric modeling 
result [Hastie and Tibshirani (1990)].^ 

In general, one can apply the Gauss-Seidel algorithm directly to a se- 
quence of smoothers with eigenvalues in (—1,1]. The algorithm is guaranteed 
to converge to a solution whenever the smoothers are diagonally dominant, 
or symmetric and positive semidefinite. In other words, one can forgo the 
first optimization problem and apply the Gauss-Seidel algorithm directly to 



^In the case of the hnear model without the graph term [i.e., f{X) =Xp or f{X) — 
ij{X)l3], we have that (2.11) with Mll = results in ^ = /S'''^ = [X^Xl)'^ X'lYl (i.e., the 
supervised ordinary least squares estimate) , which is consistent with Gulp and Michailidis 
(2008a). 
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a sequence of smoothers defined on X and G, analogous to the supervised 
setting [Hastie and Tibshirani (1990)]. For the second step, the solution to 
the optimization problem technically exists whenever the final matrix R is 
a transductive smoother. Thus, with well defined smoothers one can always 
follow this approach to obtain (2.10). 

2.1.1. Obtaining transductive smoothers for graph views. We elaborate 
next on how to obtain the smoother matrix S from an observed graph view 
G. The graph is represented by its weighted nxn adjacency matrix A defined 
above, with A{i,j) > 0. The normalized right stochastic smoother matrix S 
is then defined by S* = D~^A, with D being a diagonal matrix containing 
the row sums (node degrees) of A. Notice that the matrices A, D and S 
all emit the necessary partition structure given in (2.9). For example, the 
weighted similarity adjacency matrix A has four blocks: the All block pro- 
vides weighted links between labeled observations, the Aul and Alu blocks 
are weighted links between labeled and unlabeled observations, and the Auu 
block is comprised of weighted links between unlabeled and unlabeled ob- 
servations. For the document classification example, the weights are defined 
by document similarity between observations in LUU. 

In the Corollary to Proposition 2 in Gulp and Michailidis (2008a), we 
established that p{Suu) < 1 whenever Ajjl x 1^ > and Am/ is irreducible 
(i.e., these assumptions are sufficient for S G -4,[G]). The condition on Aul 
is interpreted that each unlabeled node has at least one connection to a 
labeled node. In data involving graphs it should be noted that this condition 
is difficult to satisfy, especially when the size of the labeled set is small 
relative to that of the unlabeled set, as illustrated in Figure 1. To account 
for this, one could start with the observed graph, compute the shortest path 
distance between nodes and obtain a new complete graph on each component 
(all nodes are connected to all other nodes within each component), and 
then, subsequently, "thin" the obtained graph. Notice that these additional 
steps can circumvent the problem by generalizing to every disconnected 
component must have a label on it as discussed in Gulp and Michailidis 
(2008a). 

The above smoother is generally too simplistic to perform well on real 
data since there is no tuning parameter A. One way to address this is to 
define the combinatorial Laplacian as P = D — A and then, subsequently, 
generalize the smoother to S = {A + XP)~^A or the symmetric smoother 
S = {I + XP)~^ in (2.7). Either of these generalizations tend to improve 
performance over that of the stochastic smoother since the parameter A can 
be estimated based on the response Yl- 

2.1.2. Obtaining transductive smoothers for feature views. In the case of 
feature data, it is fairly straight forward to obtain the transductive smoother 
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Fig. 1. A cross-section of an observation graph with both labeled and unlabeled vertices 
(observations). The labeled vertices consist of either a large dark or light circle, and the 
unlabeled vertices consist of small black circles. The interest in this example is to illustrate 
the affect of unlabeled data on the topology of the graph. 



from the penalty matrix as given in (2.7). However, additional considera- 
tions are made for the case of constructing transductive smoother matrices 
based on kernel functions, which is the approach primarily used in this work. 
Specifically, one can construct a similarity matrix W, with 

(2.13) Wij =K.y{d{xi,Xj)) with i,j eLUC/, 

where d{-,-) is a distance function applied to the vectors containing the 
data for observations i and j, and is a kernel function. The correspond- 
ing smoother is then given by 5 = {W + XP)'~^W, with P the combinatorial 
Laplacian of W. In the case of A = 1, by construction, we have that S G A[X] 
whenever Wjjl x 1/, > and Wujj is irreducible as with the graph case. How- 
ever, unlike in the graph case, this condition is typically satisfied in practice 
when using noncompact kernel functions. On the other hand, performance 
can improve by introducing a parameter K and constructing adjacency ma- 
trix A as the K nearest neighbors defined by similarity associations in W 
(often referred to as a X-NN graph). As a result, the adjacency may re- 
quire similar modifications as indicated for the observed graph to form the 
smoother S = {A + XP)~^A (or S= {I + XP)~^), where P is now redefined 
as the combinatorial Laplacian operator defined on the iC-NN graph A. 

2.1.3. Parameter estimation. As noted above, in several instances the 
proposed framework requires the estimation of several tuning parameters 
including kernel parameters, nearest neighbor parameters and Lagrangian 
parameters. In general, the tuning parameters can be broken into two groups: 
within view and between view ones. The within view parameters denoted 
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by r are necessary to construct the penalty matrix i-^^ from either X data 
somxes or graph sources [i.e., = (7^, ki) to form the /c-NN graph for view I]. 
The between view parameters correspond to the Lagrangian on the penalty 
matrices for the additive model and are denoted as A = (Ai, . . . , Ag). 

Upon completing the within view parameter estimation step, the goal is 
to obtain a penalty matrix, P^, from each view i. This problem is treated on 
a view-by-view basis. For example, in view 1 suppose that it has been estab- 
lished that a random forest learner works particularly well and, similarly, 
for view 2 a neural net learner is the best performing one. To incorporate 
this information, we consider a search over the parameter space which finds 
a transductive smoother Si that predicts similarly to the learner. More pre- 
cisely, let (f)L(^ and (jiiji be the predictions of the procedure applied to view 
i (e.g., a random forest or a neural net) and define the penalty matrix 
Pt^ = Di — Wi using the following criterion: 



In other words, the goal is to find a value 7^ and ki such that the solution 
involving smoother S in (2.10) coincides closely to predictions from learner 
(f>. Notice that the individual smoothers mimic specifically chosen learners 
within each view, which can result in strong performance in several data 
applications [refer to the pharmacology data example in Section 3.3, where 
we observe that the estimate with (2.14) performs quite strongly compared 
to the state-of-the-art co-training algorithms with random forest]. 

Upon obtaining the appropriate penalty matrices, one then must deal 
with the between view parameter estimation problem for A in (2.5). The 
Lagrangian allows one to simultaneously account for the smoother's con- 
tribution for each view. To perform this, we first make use of the fact 
that the labeled estimate is linear in Y^; that is, Yl = Mii{\)Yi, where 
Mll(A) = RllW + RluW{I - RuuW)~^Rul{X) (the notation is modi- 
fied to indicate the dependance of the smoother matrices on the parameters). 
In this case, we make use of the standard GCV criterion adjusted to trans- 
ductive smoothers given by 



In practice, the tGCV criteria is optimized simultaneously for each Aj so 
that adjustments can be made between the views. 

Remark 2. The optimization criterion in (2.15) can also be used to es- 
timate within view parameters when a learner is not available. For example, 
the parameter k on an observed graph view could be estimated with (2.15). 
However, in our experience, if a learner is available, then performance can 
improve by usage of (2.14). 



(2.14) 



min Uu, -{I- Suur'SuLYLWl with S = Dj^W^. 



(2.15) tGCV(MLL(A)) =min 
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2.1.4. View interaction terms. Next we elaborate on the view interaction 
terms provided in (2.1). We restrict attention to two-way interaction terms 
that prove most useful in practice. There are three possible types of inter- 
actions in the present setting: an interaction between two views comprised 
of feature data, fi2{Xi,X2), an interaction between two views comprised of 
observed graphs, /i2(Gi, (^2), and, finally, an interaction between a feature 
and a graph view, /i2(Xi,G2). 

In this work we are primarily interested in the case when the views are 
modeled as transductive kernel or graph smoothers. The interaction term can 
be defined as a composite graph operation, which can be achieved in various 
ways. One possibility is to use the intersection, while another the union of 
the underlying two graphs. The intersection between two graphs Gi Gj 
with corresponding weighted adjacency matrices Wi,Wj is defined by a new 

graph whose adjacency matrix is given by [Wjj](tt,i)) = ^Wi{u, v)Wj{u, v), 

while that for their union, Wij = .2 'pj^ggg terms are then processed 

as additional smoothers in (2.8). 

Remark 3. Another approach for defining interactions is given in 
Hastie and Tibshirani (1990) where the authors employ restrictions on the 
/i2(-,-) term during estimation. Extensions of this approach to (2.1) could 
also be considered especially for interactions with nonkernel based terms 
(e.g., linear terms). 

2.2. Fitting the model in classification. In classification we assume that 
the response takes on binary values, S {0, 1}. The goal is to fit a general 
semi-supervised multi-view model of the form 

rj = a + h{X) + f2{G), 

with 7] = g{fi), where g is the logit link function. Next we utilize the gener- 
alized fixed-point self-training strategy to achieve this objective. 

As previously discussed, one must first obtain the training response as 
Yy^ = [Yl,Y{j]', where g{Yu) G MI^L For the first step in (2.2), we optimize 
for fj in 

2 

(2.16) mm L{YY^,g-\rj)) + ^ A,/jP,/,-, 



^There are other possibilities for defining the union term, however, we compute it 
additively [Gould (1998)]. It should be noted that when interaction terms are included in 
the model, care should be taken to avoid identifiability issues arising from the following 
situatiom /i2(Gi UG2) ^/i(Gi) + /2(G2) - /i2(Gi nG2). 
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where the logistic loss function {L{y,p) = — y log(p) — (1 — y) log(l — p) with 
p = g~^{f)) is used and Pj are the appropriate penalty matrices for X,G, 
respectively. The solution to this problem fj = fi{Yij) must satisfy 

Employing a Taylor expansion on about 7}(^'~^) and setting it to zero, 
we get the following semi-supervised extension of the z-scoring algorithm: 

where the smoother is given by S'f "^^ = C{W^''~^^ + XiPi)~^W^''-'^\ the 
score by z*^'^"^) = f]'^~^ — W^''~'^^ ^ O^Yu ~ 5~^('7^'^~^''))) and the variance 
function by VF^^"!) = Vg~^{7])\^^^(k-i) . It is easy to see that the above for- 
mulation reduces to an application of the Gauss-Seidel algorithm for each 
zW , Unlike the regression setting, the smoother for the solution depends 
on Yu, since W depends on Yu, and hence, we require that there must ex- 
ist a R{Yu) such that r/(Y{/) = R{Y[j)z{Yu). Now, if W is diagonal, then 
zuiYu) = f]u — Wuij(Xu — g~^{'>lu))- From the fixed point step (2.3), we get 
that g{Yu) =fiu(Yi/) = zu- The iterative self-training algorithm discussed 
above must be applied in order to obtain this fixed point. The following 
proposition provides a sufficient condition for the algorithm to converge in- 
dependent of its initialization (in this case the fixed point must be unique): 

Proposition 1. Assume that the solution fj = fj exists and satisfies 

X^Pjf]-(YY^-g-\fl)) = 

for any Yjj such that g{Yij) € M'^L Assume, additionally, that there exists a 
Yfj that satisfies the fixed point solution to (2.3), i.e. giYjj) =f]uiYu). If 

with Si = {XiPi + W)~^W and W = ^ g~^ {i])\ri=fi , then the iterative self- 
training method converges independent of initialization. 

The condition on the above matrix is not of much practical use, but, 
nevertheless, it provides a general setting for which the solution to (2.3) 
uniquely exists. 

2.3. Model selection issues. Given the multi-view model (2.1) developed 
above, the next important issue to address is that of model selection, espe- 
cially in the presence of multiple views and their interactions. We present 
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next a criterion for achieving this objective. To start, for ease of presentation 
consider a model involving a feature data view X and a graph view G. In 
this work the interest is in view-nested model comparisons; for example, to 
assess the significance of view interaction term compare 

(2.17) r? = a + /i(X) + /2(G) + /i2(X,G), 

(2.18) rj = a + fi{X) + f2{G). 

An example of such a model comparison is illustrated with the CORA text 
data (Section 3.2). 

The generalized fixed point self training framework provides a natural 
environment to assess model selection in the multi-view setting. For example, 
let fji = RiZi and 172 = -^2-22 represent two model estimates. One approach 
to compare two smoothers is Akaike's Information Criterion (AIC). AIC 
compares two models by penalizing the loss of the model under consideration 
with the degrees of freedom of the smoother (tr(Mx,i^. )), where Mil. is 
linear in zl - for model j. Models with lower AIC generally fit better and are 
less complex than models with larger AICs [Hastie and Tibshirani (1990), 
Hastie, Tibshirani and Friedman (2001)]. The AIC model comparison for 
transductive smoothers is formally given by 

(2.19) iklC{MLL,) = -Loss(Mi.iJ + 

■' m m 

where the Loss(Mx,x,^.) is the L{Yl, g~^{MLL.ZL)) for some loss function L 
(for this work we use the logistic loss function). The best model is the one 
corresponding to the smoother that minimizes tAIC. Also, when optimizing 
tAIC we preserve the hierarchical constraint, that is, the presence of higher 
level interaction terms require lower level terms in the model. 



2.4. Implementation issues. The proposed fitting procedures employ ei- 
ther the Gauss-Seidel algorithm directly, or indirectly through the z-scoring 
method. However, for regression it is computationally advantageous to em- 
ploy an iterative backfitting procedure, as opposed to solving the Gauss- 
Seidel equations directly [Hastie and Tibshirani (1990)]. The main idea be- 
hind backfitting is to iteratively smooth each function to the partial residual 
without that function and subsequently mean center the function. A gener- 
alization to the local scoring algorithm is applicable to the generalized ad- 
ditive model setting. Convergence of both algorithms are discussed globally 
for several possible smoother choices [Buja, Hastie and Tibshirani (1989), 
Hastie and Tibshirani (1990)]. The transductive local scoring algorithm is 
presented in Algorithm 1 and can be used to fit all the interesting models 
under consideration; note that iterative backfitting is a special case of it. 
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Algorithm 1 Transductive Local Scoring 

1: Initialize vector Y^f^ of size n, and select a smoother type for each view 

(e.g. kernel, linear, spline, etc.) . Input tolerance S > 0. 
2: for A; = 1, ... do 

3: Apply the local scoring algorithm introduced in 
Buja, Hastie and Tibshirani (1989) with smoother type speci- 
fied and response Y^(k-i) = [Yl,Y^' ]' to determine the estimate 

4: Update y ('^+1) = g'^ (r/^^'+i) ) and hence Y^''^^^ = g-^ [f]^^^^^ ) . 

5: Stop if II r/[^^+^^ -r}[f^ ||<(5. 
6: end for 



The above algorithm tends to globally converge in practice, but the rate of 
convergence depends significantly on the choice of Y^f^ . The following argu- 
ment provides a fairly convenient initialization for this procedure. Consider 
the regression problem in (2.5) with response Yy^- Previously, we solved for 
fi{Yif) = RYyu and obtained the fixed point directly. Now, instead, we apply 
(2.7) to get that MYu) = S^Yu), where ie{Yu) = Yy^ - 7?^"^^^/) is the 
partial residual, and solve to get that fju{Yu) = Suli^Li + SuUi^u + {I ~ 
SuUe)'nu ^\^u)- Invoke step (2.3) so that Yu = fiu(Yu) and from this we 
get that ![/ = (/- SuUi)~^SuLi^Lt + Vu'^'' ■ Placing this in (2.7) for Yu can- 
cels out and yields fi = M.lEl^, where Mil^ and MjjLe are the smoothers 
identified in (2.10) for view i. From this, one can then apply backfitting di- 
rectly on centered smoothers Mll^ with response Y^ to obtain an estimate 

Yj^\ To obtain the estimate Y^^\ one could predict with the backfitting 
algorithm using smoothers Mul^-, response Yl and previous estimates Y^\ 
The initialization Yjj^ tends to be rather close to the solution from the self- 
training algorithm with centered smoothers and response Y^^ , and hence, 
the algorithm converges fairly fast. A similar initialization can be derived 
for the more general local scoring version. 

Remark 4. When either back-fitting/local scoring are employed for 
function estimation one must approximate the degrees of freedom used 
for measurements such as tGCV and tAIC; for example, a common ap- 
proximate for the denominator of (2.15) is (1 - [1 + ^^(^(M^) - l)]/\L\f 
[Hastie and Tibshirani (1990)]. 

3. Data examples. To assess the performance and usefulness of the pro- 
posed model, we have selected three diverse data examples. In the first ex- 
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Fig. 2. The grid of data given on the left has a two-class response represented by the 
light gray and dark gray points. We consider connecting points using square neighborhoods 
and diagonal neighborhoods, which correspond to the lattice (right panel). 



ample a synthetic data set comprised of two graph views is examined. In 
the second example an observed graph is combined with a feature set ob- 
tained from text data. In the third example we apply our approach to a 
pharmaceutical problem. 

3.1. Graph selection with disjoint lattices. We consider data from a two- 
dimensional lattice comprised of 625 nodes on a 25 x 25 grid (refer to Fig- 
ure 2). The example consists of two separate simulated complex response 
patterns on the graphs as shown in Figure 3 (row 1). For each response 
configuration, we allow for two different ways to connect neighboring nodes: 
square and diagonal (refer to Figure 2). Let Lg be the adjacency matrix 
for the square neighborhood lattice and for the diagonal neighborhood 
lattice. The following model was considered for each response configuration 
(checkerboard, or mixed): 

(3.1) r] = a + fs{Ls) + fd{Ld). 

The objective was to determine which graphs were important for predicting 
the response in each configuration using classification accuracy and the tAIC 
measure. 

In the analysis we first sampled 10% of the observations in the lattice to 
be treated as labeled nodes (cases), while the remaining 90% were treated 
as unlabeled cases. The weight matrix for each lattice configuration (square, 
and mixed) was constructed by Wij = K^{dsp{i,j)), with K^{d) = e~'^l'^ ^ 
where dsp{i,j) denotes the shortest path from node i to node j on the 
lattice (note i,j € L U U). The penalty matrix for the lattice was given 
by P = D — W, with D the diagonal row sum matrix of W. In the es- 
timation of r] = a + fi{L() with (. S {s^d} the smoothers were given by 
Sf. = {Wi + Pe)~^W£ and the parameter 7^ was estimated using the tGCV 
criterion. For the additive model, r] = a + fi^ the penalty matrix P^, the 
kernel matrix Wi and the parameter vector A were supplied to the local 
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Fig. 3. The response configurations are presented in row 1. The accuracy for each re- 
sponse configuration is given in row 2, while the tAIC for each response configuration 
is given in row 3. S denotes square model, D denotes diagonal model, and S+D denotes 
additive model. 
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scoring algorithm as input. However, before proceeding with local scoring, 
the A parameters were estimated simultaneously using the tGCV criterion 
analogous to the approach discussed in Hastie and Tibshirani (1990).^ This 
process was repeated 50 times. Then, the entire analysis was executed with 
Q% of the data treated as labeled {Q = 20, 30, ... , 90). The average accuracy 
(second column) and tAIC (third column) for each labeled partition size is 
reported in Figure 3. 

The accuracy plot for the checkerboard example illustrates that the model 
with square neighborhoods exhibited the best performance. In the graph 
selection plot the square neighborhood graph provided the optimal model 
in terms of smallest tAIC. Therefore, the model suggests that the square 
neighborhood view fits the data well, in accordance with the underlying 
checkerboard configuration. For the mixed configuration, the additive model 
performed marginally better than the other configurations. However, as the 
size of the labeled data grew, the square and diagonal graphs minimized 
tAIC. 

3.2. Text analysis. In our next example we consider 776 documents ob- 
tained from the artificial intelligence/machine learning segment of the CORA 
text data [McCallum et al. (2000)]. The artificial intelligence segment con- 
sists of text documents that address the general topics of machine learning, 
planning, theorem proving, robotics, expert systems and others. The binary 
response is the indicator that the text document is specifically about ma- 
chine learning. The first view corresponds to the co-citation network, where 
the vertices are the labeled and unlabeled text documents, and the edges 
are the number of times that each pairwise observation agree in citation 
(co-citation). Specifically, the adjacency matrix is constructed as 

r Total ^ of documents co-cited with i, ^ = J, 
~ I ^ I{i and j cite the same document}, i ^ j. 

The second view contains 141 carefully parsed words used in the title of 
each of the text documents (e.g., learn, net, theory, etc.). The text string is 
a partial match where the first letters of the word in the title must match the 
variable; for example, if the variable is net, then the observation represents 
a count of any variation of net, nets, network, etc. The following logistic 
model was used: 

(3.2) rj = a + h{Xutle) + f2{Gc^te) + h2{Xt^tle,Gc^te). 



^To speed-up computation, transductive backfitting was employed with response Yl 
treated as continuous for only the parameter estimation component. Transductive local 
scoring was used for fitting the actual model in each step. 
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Fig. 4. (left) The plot provides the average accuracy over 50 samples from applying 
the multi-viewed model on the cora text data as the amount of labeled data varies from 
10% to 90%. (right) The corresponding tAIC plot is provided. T and C denote the models 
constructed using the word view and the citation view, respectively. T+C is the main effects 
additive model, and T+C+T*C is the full model including the interaction between views. 



To compute the interaction term, we employ the intersection operation by 
defining VFi2ij = \JWi~*~A^., where Wi is the kernel matrix on the title 
view and A2 is the co-citation adjacency matrix. The goal is determine the 
simplest model to adequately predict whether a document is classified as 
addressing the specific topic of machine learning in the artificial intelligence 
network. 

As before, the percentage of labeled cases was varied from 10% to 90%. 
The average accuracy results, based on 50 replications, are shown in Figure 
4 (left panel) and the tAIC results in Figure 4 (right panel). Cosine dis- 
similarity was used to construct the distances between observations on the 
title view and tCCV was used to estimate the parameters. The accuracy 
results tend to favor the models with both the text and citation informa- 
tion. The tAIC measure indicates that the additive model comprised of the 
text and citation views without interaction was the minimizer. From this 
result we select the model 77 = fi{Xutie) + f2{Gcite) as dominant and drop 
the interaction term between Xtuie and Gdte- 

Next, we assess the proposed approach against the spectral graph trans- 
ducer (SGT), first using the Gdte view, and then using both the Xutif. and 
Gcite views [Joachims (2003)]. In Figure 5, the SGT(X, G) dominates in the 
10-30% configurations and remains competitive for the larger labeled parti- 
tions. The SGT(G), however, is only competitive in the 10% configuration. 
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3.3. Drug discovery data. We revisit one of the motivating examples for 
this work where the observations correspond to compounds which could 
potentially become drugs. The goal is to assess early in the drug discovery 
process the potential of a compound to cause an adverse event (AE) or 
side-effect. Clearly, a compound's AE status is critically important to its 
success, and as a result, pharmaceutical companies would like to identify 
these compounds as early as possible in the discovery process. The targeted 
compounds can then be modified in an attempt to reduce the adverse event 
status while maintaining their effectiveness, or eliminated from follow-up 
altogether. 

The set of predictors consists of information describing the biological rela- 
tionship (view 1) between a compound and a particular target and the chem- 
ical relationship (view 2) which provides descriptors based on the structure 
of the compound. In order to obtain the necessary biological information, a 
therapeutically relevant concentration of the drug is applied to the target, 
and the inhibition of the target's activity is measured. This view consists of 
pi = 191 continuous and noisy variables, each consisting of a carefully cho- 
sen target. The chemical predictors are represented by a sparse and binary 
set of descriptors {p2 = 151), where each descriptor represents a specific sub- 
structure in the compound. For the data set under consideration, there are 
n = 438 compounds, of which 92 are known to be associated with a specific 
AE (y = 1, otherwise Y = 0). 
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Fig. 5. The plot provides the accuracy results with 95% confidence hands from applying 
multi-mew model with the spectral graph transducer using the co-citation view only, and 
the analogous plots with both the title and co-citation views. The amount of labeled data 
varies from 10% to 90%. 
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Fig. 6. (left) The plot provides the average accuracy over 50 samples from applying the 
multi-view model [recall that random forest were employed in (2.14-)] on the pharmacology 
data as the amount of labeled data varies from 10% to 90%. (right) The corresponding 
tAIC plot is provided. B and C denote the models constructed using just the biology view 
and just the chemistry view, respectively. B+C is the main effects additive model, and 
B+C+B*C' is the full model including the interaction between views. The 95% confidence 
bands are provided to assess the precision of the contribution for a particular model. 



In addition to generating a predictive model, scientists are also inter- 
ested in assessing the importance of each descriptor set. That is, chemical 
fingerprints (data in view 2) are extremely cheap to obtain, only requiring 
computation time, whereas the biological information takes more time and 
money to generate because each compound needs to be assayed. Hence, as- 
sessing the importance of both views of information has important resource 
implications. To determine the appropriate model for this data, the following 
logistic multi-view model was fitted to the data: 

1] = a + fliXsio) + f2{Xchem) + fl2{X Bio , X chem) ■ 

The smoothers for each term in the above model were generated by opti- 
mizing (2.14), using random forests in each view. The results are shown in 
Figure 6 for both accuracy and tAIC. In accuracy, the additive model tends 
to improve over that of the chemistry view only model, but the improve- 
ment is not significant. From the tAIC analysis, the additive model with 
both views seems useful for smaller labeled partitions, but as the %-labeled 
increases, its utility diminishes to that of the chemistry view only model. 
This result suggests that the biology view and interactions involving this 
view are not contributing significantly to the performance of this model. 

Next, we wish to assess the proposed modeling approach compared to 
other multi-view procedures. From the above analysis, the chemistry only 
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Fig. 7. (left) The plot provides the accuracy results with 95% confidence bands from 
applying the multi-view model with random forest, Co-Training with RF, the spectral graph 
transducer, and the random forest without a view distinction on the pharmacology data as 
the amount of labeled data varies from 10% to 90%. (right) The plot provides the analogous 
results for the Kappa performance measure, which better accounts for unbalanced classes. 



view model is all that is necessary, but we use the biology/chemistry model 
without interaction for comparison with other multi-view techniques. In ad- 
dition to the multi-view model, a random forest without making a view dis- 
tinction was fitted to the data [i.e., random forest fit directly to (Bio,Chem)], 
the co-training procedure discussed in the introduction using random forests 
as the base learner [Blum and Mitchell (1998)] and the SGT approach was 
employed on this data. To measure performance, we partition the data into 
50 10-90% labeled groups, each with the remainder treated as unlabeled, 
and applied both accuracy and kappa measures to the unlabeled data. The 
kappa measure is defined as {O — E)/{1 — E), where O is the observed 
agreement in the testing confusion matrix and E is the expected agreement 
[Cohen (I960)]. Values close to represent poor agreement, while values 
close to 1 represent perfect agreement. Because kappa compares observed 
agreement to expected agreement, it is helpful for assessing performance for 
unbalanced data. 

The multi-view model with random forest in (2.14) and the co-training 
procedure are quite competitive in both the accuracy and kappa measures 
(refer to Figure 7). The results based on kappa reveal that co-training is 
somewhat more conservative than the multi-view model (i.e., tends to pre- 
dict several observations as not having an AE), and therefore, the multi-view 
model exhibits a strong performance in kappa with a slight deterioration in 
accuracy. The supervised random forest applied without view distinction 
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and the SGT performed poorly with respect to both measures for this data 
set. 

4. Concluding remarks. In this paper we developed a modeling frame- 
work suitable for analyzing multi-view data. Its main features are: (i) the 
generalized fixed point framework to fit semi-supervised additive models to 
both observed graph and feature views, (ii) mechanisms to perform view 
selection and incorporate view interactions and (iii) data-driven tuning pa- 
rameter estimation. The proposed framework and subsequent developments 
provide a marked departure from the original co-training algorithms into a 
data analysis setting by allowing view interactions and selections. 

A topic of future study is the ability to assess variable contribution 
within and between views. In this setting, the individual feature views 
are constructed from variables and, therefore, the contribution of a par- 
ticular view depends on that of the underlying variables. On the other 
hand, a variable's contribution may occur at the view level such as in 
an interaction. Other interesting issues of study involve inference, test- 
ing and transductive covariance estimation. The process of predicting new 
observations (as opposed to retraining, which is the current process) is 
also under investigation. The difficulty of this problem, often referred to 
as inductive learning, is noted in several references [Gulp and Michailidis 
(2008), Krishnapuram et al. (2005), Zhu, Ghahramani and Lafferty (2003), 
Wang and Zhang (2006), Zhu (2007)] and is worthy of its own investiga- 
tion. 

5. Proof of Proposition 1. Next we provide proof for Proposition 1. 
Proposition 1. Assume that the solution fj = fj exists and satisfies 



with Si = (XiPi + iy)~^iy and W = Vg~^{r])\ri=fi, then the iterative self- 
training method converges independent of initialization. 

Proof. By assumption, we have that 





A.-P.(4'^-/i) = (^y(-i)-^y,)-(5 
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where 57 = + J2j fj and f/^'^) = a^'^^ + J2j fj^'^- We also have that 

g-Hv^"^) = g-Hv) + w{fi^''^ - 11) + o(i)n, 

with W = V (r])\^=fi . Putting these together, we get that 
and hence, 



+ 0(1) 



with Si = {W + AjPj) ^W. After some algebra and solving for a common 
term, 

(/ - S,S,){ff^ - h) = S,{I - S,)W-\Y^^,-^, - Y^J + 0(1), i / J. 
Assume is a multiple of / and define Ri = {I — SiSj)^^ Sj{I — Sj), then 

/ff - h = RuuAfit'^ - m) + 0(1). 

From this, we can cycle the above statements with i = 1, . . . ,q for (. = 1, . . . ,k 
to get that 

/ q \k-l 

II- - = (J2 Puu.j ivS^ - m) + o(i), 

/g) - fu, = Ruu, Puu}j {vS^ - flu) + 0(1). 

Therefore, if [Ej=i -Rf/C/,]'' ^ 0, then convergence of the algorithm is guar- 
anteed. The actual initialization fi^'^^ is of no consequence, therefore, the 
convergence is independent of initialization. □ 
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